feat: custom sharding weights #38624

Skn0tt · 2025-12-19T12:45:13Z

Inspired by #38598, alternative to #38588. Let's play around with it and see how it works for our own CI.

pavelfeldman

Lgtm module cli flag.

pavelfeldman · 2025-12-22T18:37:56Z

docs/src/test-api/class-fullconfig.md

 - type: <[null]|[Object]>
  - `total` <[int]> The total number of shards.
  - `current` <[int]> The index of the shard to execute, one-based.
+  - `weights` ?<[Array]<[int]>> The shard weights.


Let's not expose them for now.

pavelfeldman · 2025-12-22T18:41:16Z

docs/src/test-sharding-js.md

+## Rebalancing Shards
+
+```
+npx playwright test --shard=1/4,26/24/25/25


I'd say --shard=1/4 --shard-weights=26:24:25:25 to keep things readable

github-actions · 2025-12-23T10:06:51Z

Test results for "MCP"

2727 passed, 116 skipped

Merge workflow run.

github-actions · 2025-12-23T10:16:03Z

Test results for "tests 1"

1 failed
❌ [playwright-test] › runner.spec.ts:124 › should ignore subprocess creation error because of SIGINT @macos-latest-node20-2

1 flaky

⚠️ [firefox-library] › library/inspector/cli-codegen-1.spec.ts:1082 › cli codegen › should not throw csp directive violation errors `@firefox-ubuntu-22.04-node20`

34396 passed, 689 skipped

Merge workflow run.

sk33wiff · 2026-01-07T23:47:00Z

@Skn0tt what release will include these changes?

Skn0tt · 2026-01-08T07:56:52Z

We're planning for 1.58 to include this.

gpaciga · 2026-01-12T16:46:08Z

I don't see how this solves the distribution problem raised in the various referenced issues. Consider an unbalanced suite with 1000 tests that take 1s each (~17 minutes total) and two tests that each take 20 minutes, but are unluckily the last tests in the list. I don't think I can use weights to evenly balance this into 2 shards. If the the shards are weighted to split up the two 20 minute tests, the first shard will be 37 minutes and second 20 minutes. If the two are in the same shard, the first shard will be at most 17 minutes and the 2nd will be at least 40 minutes.

So, this is not actually a viable alternative to #38588

gerardaz · 2026-01-12T18:59:23Z

+1 to the above – I'm not convinced this really solves the problem, or if it does, only in a way which doesn't seem very user friendly.

If I'm not mistaken, this just means you can control how many tests are assigned to each shard? So you need to evaluate ahead of time how the tests will bucket, right?

It won't affect the ordering of them. If you have 6 heavy tests in direct sequence (e.g. 10 minutes each) while most the rest of the 300 tests take seconds, it doesn't seem like there is any conceivable way to come close to balancing them with this implementation, except having nearly 300 shards.

I'm not sure I understand the scenarios that this is solving for though, so maybe this is my own misunderstanding or lack of imagination.

Doesn't this only really address cases where the tests have a sort of normal distribution of length? So you'd compute (how I'm not sure) where to redraw the boundaries to rebalance them?

My situation, which I feel from the above comment from @gpaciga is not unique, is we have a roughly even distribution of runtimes, with a small number of clustered extreme outliers. If I have 6 outliers in a suite of 300 tests, with 6 shards, the optimal distribution is that none of these six run in the same shards as any of the other (or close to it). But this weighting wouldn't appear to make this possible – even if it did, wouldn't new tests distort the distribution?

muhqu · 2026-01-12T19:48:59Z

@Skn0tt great to see sharding getting some attention!

However, I wonder what's the reason this sharding weights approach is considered as an alternative to sharding based on duration / timing data?

Skn0tt · 2026-01-13T11:17:00Z

That's good feedback, thanks. We were thinking that weights are easier to maintain than duration data, but you're right that for uneven scenarios this doesn't solve the ordering problem. Whereas something like greedy binpacking based on duration data would. I'll take this feedback back to the drawing board.

muhqu · 2026-01-13T13:04:21Z

@Skn0tt the shard weights how they are implemented in this PR benefit when your shards always run on the same machines and they have different performance specs. E.g. shard 1/4 runs on a high performing machine and therefor the shard can take more weight than the other shards.

However, if you utilize cloud infrastructure to run your tests it's very likely that all shards run on machines that have equal performance specs.

It would be great if you could have different weight per test and the sharding automatically distributes tests based on these per test weights... wdyt?

Skn0tt · 2026-01-13T13:21:05Z

I see how shard weights can help balance across different machine types, but shards typically all use the same specs. Weights can still help, because sharding is stable. The same test always lands in the same shard, and if a significant number of expensive tests happens to land in one shard, that shard will take longer.

That's what's happening in our own tests: We have roughly 40k tests across different bots and 2 shards, and we found that shard 2 alwasy took longer because those tests are just slightly longer, and it compounds at scale. So attributing more weight to shard 1 allows us to iron out the disbalance, see #38635 for numbers.

This works best in large test suites where the disbalance is compounding, and the number of tests is stable. It's not as effective in the case outlined above, where there's a small number of disproportionally expensive tests that all land in the same shard.

#38588 is the alternative that solves both usecases, but we were afraid of the usability of committing duration stats into the repo. We'll reevaluate.

muhqu · 2026-01-13T17:35:34Z

@Skn0tt There is no need to commit the durations to the repository. We're also not doing that. They just serve as a way to optimize overall CI time.

In our setup (still using the changes of #30962, since 2yrs already) we have the duration / timing data in the last-run.json, which we also do not check into source control. But our CI archives the last-run.json for every main build and for every build we're restoring the last successful last-run.json which gives us optimal distribution of tests across shards.

Skn0tt · 2026-01-13T17:37:32Z

I see! What CI runner are you using, and how are you restoring the file across builds?

muhqu · 2026-01-13T17:42:48Z

We're using Jenkins Pipeline. The Archive step allows us to make the last-run.json available to other pipelines...

however, I'm pretty sure you can use a post-build step in any other CI solution to store your test durations of the last successful main build in some place that you can then use for the next build. E.g. you could even post it as a gist or upload to an AWS S3 bucket...

gpaciga · 2026-01-13T17:57:55Z

FWIW I've been using the same approach as @muhqu for approx 6 months, on Github actions, archiving last-run.json in an s3 bucket. The file is downloaded at the start of each run. The results from the main branch are used the first time a new PR runs, since it won't have its own timing data yet.

allow custom sharding

f8594a5

Skn0tt self-assigned this Dec 19, 2025

Skn0tt mentioned this pull request Dec 19, 2025

[Feature]: allow specifying how many tests to run based on count and offset. #38598

Closed

This comment has been minimized.

Sign in to view

pavelfeldman approved these changes Dec 22, 2025

View reviewed changes

Skn0tt added 2 commits December 23, 2025 08:48

address feedback

d16fd29

add code lang

9063631

This comment has been minimized.

Sign in to view

it's bash!

c6b6ce1

Skn0tt merged commit cb93c44 into microsoft:main Dec 23, 2025
30 of 31 checks passed

feat: custom sharding weights #38624

feat: custom sharding weights #38624

Conversation

Skn0tt commented Dec 19, 2025

Uh oh!

This comment has been minimized.

pavelfeldman left a comment

Choose a reason for hiding this comment

Uh oh!

pavelfeldman Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

pavelfeldman Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Dec 23, 2025

Test results for "MCP"

Uh oh!

Uh oh!

github-actions bot commented Dec 23, 2025

Test results for "tests 1"

Uh oh!

sk33wiff commented Jan 7, 2026

Uh oh!

Skn0tt commented Jan 8, 2026

Uh oh!

gpaciga commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gerardaz commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muhqu commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skn0tt commented Jan 13, 2026

Uh oh!

muhqu commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skn0tt commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muhqu commented Jan 13, 2026

Uh oh!

Skn0tt commented Jan 13, 2026

Uh oh!

muhqu commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gpaciga commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gpaciga commented Jan 12, 2026 •

edited

Loading

gerardaz commented Jan 12, 2026 •

edited

Loading

muhqu commented Jan 12, 2026 •

edited

Loading

muhqu commented Jan 13, 2026 •

edited

Loading

Skn0tt commented Jan 13, 2026 •

edited

Loading

muhqu commented Jan 13, 2026 •

edited

Loading